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Resilience and fault tolerance are challenging tasks in the field of high 
performance computing (HPC) and extreme scale systems. Components fail 
more often in such systems, results in application abort. Adopting fault— 
tolerance techniques can be consistently detect failures and continue 
application’s execution even if the failures exist. A prominent parallel 
programming specification, message passing interface (MPI), as it would be 
used to implement failure detection and consensus algorithm in this paper. 
Although the MPI does not facilitate fault tolerant behavior, this work 
presents a fault tolerant, matrix based failure detection and consensus 
algorithm. The proposed algorithm uses Gossiping. To detect failures, 
randomised pinging will be applied during the execution of the algorithm by 
using piggybacked gossip messages. In order to achieve consensus on the 
failures in the system, failed processes’ information will be sent using the 


MPI same piggybacked gossip messages to all the alive processes. The algorithm 
Parallelism was implemented in MPI framework and is completely fault tolerant. The 
Scalability results exhibit all the MPI process failures were detected using randomised 
pinging and global consensus has achieved on failed MPI process in the 

system. 
This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

Large-scale and extreme-—scale systems are required to work under component failures very often. 
Fault tolerance approach is essential for future extreme, large system sizes which fail more frequently due to 
components (node and process) failure. Such systems continue to increase components count, individual 
component reliability decreases and software complexity increase [1]. To ensure parallel application 
correctness and execution efficiency in the realm of large scale distributed systems, frequent failures must 
overcome. Complete application is aborted due to frequent component failures. Adopting fault tolerant, 
scalable failure detection and consensus approaches will allow to continue the application’s execution even in 
the presence of failures. 

Epidemic failure detection [2] is one of the foundation of fault tolerance in distributed systems. 
Failure detection can takes place through gossiping known as gossip—based failure detection in which each 
process announces its aliveness to its neighbour processes frequently. With this notion, every process in the 
system will come to know about every other process and decide whether a process is alive or failed. This 
information is gradually disseminated through the network using the same Gossiping. In the field of fault- 
tolerant computing, the consensus problem is the formation of an agreement may be made on any value 
among the fault—free processes in order to keep up the integrity and performance of the system [3]. The 
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notion of consensus (agreement) is to share information within a group of processes, ideally in a fault tolerant 
way i.e. the fault free processes should be able to agree on and give accurate results. 

The details of the consensus problem is illustrated in [3]-[5] (chapter 14, section 14.1.1). The paper 
[6] contributes two state-of-the-art Gossip based algorithms that use randomized pinging to detect all the 
process failures before and during the execution of the algorithms at high speed during Gossip cycles. 
Failures are circulated to each of the alive processes and consensus is being achieved on the failed processes 
with the help of consensus algorithms. Implementation and testing are performed with an extreme scale 
simulator. Consensus is being detected correctly by all the alive processes when they identify the existence of 
failed processes in the system. 

A part from failure detection and achieving consensus, disseminating global information and 
computing is also a challenging task in large scale, distributed systems [7]. Fault tolerant approaches are 
appropriate for this task to avoid bottlenecks and failures. Several aggregation protocols were developed and 
are broadly classified as: (i) tree based protocols and (ii) epidemic protocols. Due to the use of randomized 
communication model, epidemic protocols are robust, scalable and support maximum number of 
communications contrast to tree based protocols. Epidemic protocols have the advantage of spreading the 
information at high speed without additional communication overhead which are inherently fault tolerant. An 
effective failure detection in large, distributed systems can be done by means of Gossip based protocols. In 
[8], a Gossip based failure detection and consensus algorithm is examined to mitigate resource utilization and 
consensus time. Due to its resilient nature, random gossiping among the nodes has been investigated for 
detecting failures in the system. Moreover, the Gossip protocols have the capability to scale up the processes 
count. 

Ayiad and Fatta [9] proposed an epidemic consensus protocol to achieve both global agreement 
(consensus) among all the nodes from local computation using a decentralised data aggregation. It is 
composed of four phases: Aggregation phase, Convergence phase, Agreement, and Commit phase. To 
function epidemic consensus protocol, two more protocols: node cache protocol and system size estimation 
protocol are used to achieve global agreement. The proposed work uses a single epidemic protocol to achieve 
consensus on MPI process failures. Katti and Lilja [10] proposed a combined epidemic failure detection and 
consensus algorithm which is acceptable for a very large scale systems. The PING REPLY mechanism used 
to detect process failures and consensus. The mechanism gossips the information with a single process and 
spread the information to all alive processes using the same PING REPLY in the startup phase, growth phase, 
shrink phase and final phase. The proposed algorithm is separated into four logical tasks: matrix 
initialization, detecting failures, merging the fault matrices, and check for consensus according to MPI 
primitives. 

In a large scale distributed systems, reaching an agreement among the processes is a fundamental 
need even in the presence of fault processes. A new epidemic approach i.e. information dissemination 
application [11] is proposed that simulate spreading process information and achieve global consensus in a 
decentralised fashion. Instead of using a separate application for disseminating the MPI processes 
information, the proposed work can be of use the same Gossiping approach in spreading process information. 
Katti et al. [12] introduced three algorithms based on epidemic protocols using xSim (extreme scale) 
Simulator for failure detection and consensus. In order to facilitate consensus detection, the first algorithm 
maintains an integer matrix at each process to store the status of all processes in the system. Thus the 
algorithm does not scale well. 

This paper supplements the work in [12] and focus primarily on the first algorithm to increase its 
scalability. The proposed algorithm scalability has increased by maintaining the system view at every 
individual process in the form of a boolean matrix as shown in Figure 1. Consensus (global agreement) is 
achieved with the help of alive processes using the same approach of Gossiping by preserving the status of all 
processes in a matrix. Every process in the system maintains the status of other processes in a fault matrix, 
say, Fm’ holds n*n elements where ’n’ indicate number of processes. ’0’ indicate process is alive. ’\’ 
indicate process have failed. For instance, five processes ’Po P4’ are shown in Figure 1. We can check for 
consensus at any one of the alive process say ’P2’. We can detect a particular process say ’P,’ is alive or 
failed with other alive processes ’Po, Pi, P3, and Py’ by overlapping process ’P2’ with process °P? which is 
shown in Figure 1 separately. Hence, in this case, consensus is detected by process ’P2’ for process ’P,’. The 
size of the fault matrix increases with the system size. To detect consensus, the number of Gossip cycles 
logarithmically increases with system size. The proposed algorithm is implemented and tested by means of 
MPI point — to — point communication primitives. 

Snir et al. [13] proposed a programming model, MPI, increases the performance, execution speed 
and scalability. Enthused by this, MPI, a standard, defines a set of library methods that are useful to 
implement portable and scalable parallel applications. It is designed by researchers from software industry 
and academia to build large-scale parallel applications. MPI allow users to create parallel programs in C or 
Fortran 77 that run on parallel architectures more efficiently. MPI is a standard programming paradigm 
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created by the MPI forum, widely used on scalable parallel computers (SPCs). The major goal of MPI are 
scalability, portability, reliable messages transmission, added credibility to parallel computing, compatible on 
heterogeneous systems and achieving high performance. Many features that were included in MPI are as 
follows: Point to point communication, process groups, process topologies, profiling interface, collective 
operations, communication domains, environmental management and inquiry, bindings for C and fortran 77 
languages. In the present work, out of all features available in MPI, point to point communication primitives 
were being used in implementing the proposed algorithm. 


Po Pi P2 Pa Pa 
Po | 0 1 0 | 0 | 0 
P; | 0 | 0; 0; 01] 0 
Pa | 0 1 O | 1 0 1 1| 1 cd 
Ps | 0 | 0; 0/0 1] 0 
Py, | O 1 0 | 0) 0 


Figure 1. Consensus detection on failure of process °P,’ 


Margolin and Barak [14] presented a tree based fault tolerant algorithm using collective operations 
for parallel MPI applications that detect failures and failure recovery takes place as and when a failure has 
occurred. In [15], a user level failure mitigation (ULFM) specification is implemented in MPI to detect 
failures without stopping the applications execution in exascale systems. To solve the problem of studying 
and comparing different MPI fault tolerance techniques that helps in resuming system failures, Guo et al. 
[16] developed a bench mark suite: MATCH which will compare MPI fault tolerance designs for different 
scenarios. Chakraborty et al. [17] implemented a global restart model: EREINIT for bulk synchronous MPI 
applications to decrease failure recovery time and improve scalability. Failure detection, recover from failure 
and notification three basic mechanisms were optimized and implemented in MPI. Hassani ef al. [18] 
designed a fault — aware MPI standard that implement fault — tolerant methods to confront the failures. A 
portable implementation of MPI tool, MPICH is introduced for developing parallel applications [19]. ULFM 
interface is used to program fault tolerant MPI in a large molecular dynamics application (a case study) [20]. 
[21] MPI is the best tool to implement parallelism in C, C++ and FORTRAN that has necessary standard 
libraries. Some studies has implemented the MPI according to their research problems. We have also 
implemented a unique MPI — based algorithm for our identified problem statement [14]—[21]. 

The paper is organized as: in section 2, details of the Gossip style matrix based failure detection and 
consensus algorithms are provided. Section 3 presents experimental results. Finally, section 4 concludes with 
presented work. 


2. METHOD 

Failure detection is essential for minimising failures and a core component of any resilience 
requiring infrastructure [22]. A failure detector is a distributed service capable of returning any processes and 
node’s status, whether alive or dead. The overall performance of the high performance computing systems 
will affect in terms of latency to detect and propagate failures, and in terms of communication overhead and 
computation. In general, any process and node in the communication channel can communicate to any other 
process and node by sending messages that takes maximal time to be delivered. If a process and node has 
failed, then all the communication channels are emptied and is treated as permanent failure. Gossip 
approaches potentially incorporate random failure detection and propagation times [23]. Process and node 
randomly choose other processes and nodes with whom they share their failure information using gossip style 
protocols which is an alternative approach to implement scalable failure detectors. These protocols transmits 
information about all currently known failures using ping reply messages. 


3.1. Randomised pinging for failure detection 

This section discusses randomised pinging for failure detection as part of epidemic protocol. 
Failures are detected by a process by randomly pinging other process periodically. Process p selects process q 
randomly to ping during TGossip cycle of length. Until the end of ongoing TGossip cycle, if process q replies, 
then process p detects process q as alive; or else failed. Figure 2 shows the algorithm pseudo code. During 
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Toossip cycle, the chances to choose a process is 0 and 1 or more ping messages adhere to binomial 
distribution. Thus, the scalability of detecting a failed process by one or more alive processes have increased, 
thus achieving consensus by propagating the information. The implemented matrix based failure detection 
and consensus algorithm can permit medium-to-low delays and message losses, hence, it is completely fault 
tolerant. 


At each process p 


At every Toossip time (at each cycle) 


1 choose a random process q 
2 dispatch ping message to process q 


At an event: a ping message received from q 


3 dispatch reply message to process q 


At an event: reply not received from g before timeout 


4 write process q has failed 


Figure 2. Failure detection using randomised pinging 


2.2. Achieving consensus using global knowledge 

This section discusses achieving consensus on MPI failed processes by enabling global consensus 
knowledge at every MPI process. Figure 3 shows the consensus algorithm where failures are detected by 
pinging random processes. The status of all the MPI processes is maintained in a boolean matrix F;. An entry 
F[d, s] in the matrix denotes the status of process s as detected by process d. The consensus algorithm is 
separated into four logical sections as: i) initialisation, ii) detecting failures, iii) updating fault matrix and iv) 
check for consensus. 


2.2.1. Initialisation 
The line numbers 1 — 5 of Figure 3 assumes that every MPI process in the system is alive. No MPI 
process in the system detected any failures yet. 


2.2.2. Failure detection 

The line numbers 6 and 7 detects failures using randomised pinging by selecting a random process j, 
dispatching a ping message to process j piggybacking F; fault matrix at every TGossip time. At line number 8, a 
timeout event is created during the current TGossip cycle to receive a reply message from process j. The line 
numbers 20 — 22 indicates a reply message is sent from process j piggybacking the fault matrix F;. At line 
number 32, if no reply message is received by process i from process j at the end of the current TGossip cycle, 
process i detects (directly) process j to have failed. 


2.2.3. Fault matrix update 

The fault matrix is updated once a Gossip message (ping or reply) is received by process i from 
process q. The line numbers 23 — 27 and 29 — 31 updates the local fault matrix F; by carrying out a logical 
OR operation between the corresponding elements in matrix F} excluding ith row. At line number 28, an 
indirect local failure detection is performed by updating the process i in matrix F; (row i in Fj) to incorporate 
the process q detections (row q in F,). Sending entire fault matrix F; information as part of Gossiping 
propagates process į detections along with other processes detection recognised by process i. 


2.2.4. Check for consensus 

Lastly the line numbers 9 — 19, consensus is checked on process s at i by carrying out a logical OR 
operation between sth column and its ith row corresponding elements. Thus, consensus has been reached 
when all alive processes in the system have detected a failed process. 


Scalable epidemic message passing interface fault tolerance (Soma Sekhar Kolisetty) 


1002 O ISSN: 2302-9285 


At each process i 


Tcossip Cycle of length and 
Tout timeout period are required 


Matrix initialization: 
// Fiıld, s]: process s status as detected by process d 
// Fault Matrix Fi;[d, s] where 0 <d, s< n 

1 for (d=0, d<n, d+ +) 

2 for (s = 0, s <n, s + +) 


3 Fild, s] = 0 // Every process in the system is alive 
4 endfor 
5 endfor 


At every Toossip time (at each Teossip cycle) 
// Failure Detection using randomised pinging 


6 choose a random process j 


~ 


send a ping message to j piggybacking Fi 


8 receive a reply message from j by creating a timeout 
event Es =<present cycle no + Tour, j > 
// check for consensus on process s 
9 for (s = 0, s < H Ss + +) 


10 temp = 0 
11 for (d= 0,.d< ny d+ +) 
12 if (Fild, s] || Fili, dl) 
13 temp += 1 
14 endif 


15 endfor 
// a failed process is identified by all alive processes 


16 if (temp == n) 
17 consensus achieved on process s 
18 endif 


19 endfor 
At an event: a message received from g piggybacked with F, 


20 if (message == ping 
21 reply message dispatched to process q piggybacking F; 
22 endif 


23 for (s = 0, s <n, s+ +) 
24 for (d=0, d<n, d+ +) 


25 if (d != i) //transmitting remote failure detection 
26 Fild, s] = Fild, s] || Fy Id, s] 

27 else //indirect local failure detection 

28 Fili, s] = Fili, s] || Fy lq, 8] 

29 endif 


30 endfor 
31 endfor //merging the fault matrices 


At an event: no reply message received from j within Tos and timeout event E; 
// (direct Failure Detection) mark j to have failed 
32 Fifi, j] = 1 


Figure 3. Failure consensus by global knowledge 


3. RESULTS AND DISCUSSION 
3.1. Algorithm analysis 

Epidemic protocols in large and extreme scale distributed systems are based on a peer-to-peer (P2P) 
model for computation and decentralised communication. In P2P networks, processes and nodes may join 
and leave the network arbitrarily (a.k.a Node churn [24]) in dynamic systems and may fail suddenly. It 
effects the robustness, efficiency of epidemic systems [25] and experimenting with a protocol during node 
churn is not an easy task. The proposed algorithm can be used to detect failures and achieve consensus using 
the MPI framework primitives. The implementation procedure for the proposed algorithm is given: 
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Generate random processes and iterations 

Failures were injected before and during the execution of the algorithm 
Initialise fault matrix 

MPI_MAX_ERROR_STRING is used for error handling of MPI 
MPI_Comm_size returns number of processes associated with a communicator 
MPI_Comm_rank returns rank of current process in the communicator 
MPI_Barrier will synchronise between MPI processes 


ro mo ao se 


MPI_ Gather collects local and global consensus information from all the processes 

MPI_Type_contiguous creates a new data type in contiguous memory locations for sending failure 

injection information 

MPI_Type_commit committing the new data type for communication 

MPI _ Send the failure injection information is sent to the destination 

MPI_Recyv the failure injection information is received from root 
. MPI_Irecv start / post reception for incoming ping / reply messages 

MPI_Wtime returns an elapsed time in seconds 

MPI_Test tests whether a send / receive operation is completed 

Displaying the total number of MPI processes that have reached consensus 

Producing the fault matrix after achieving global consensus 

At every Gossip cycle, a ping reply communication is adopted in the algorithm in order to 

detect failures using randomised failure detection. During a Gossip cycle, a MPI process that receives 
ping messages follows binomial distribution. For this reason, the probability of receiving a single 
ping message/multiple ping messages by a failed process is very less. Therefore, at the earliest Gossip 
cycles, failed processes are detected. 

Using the same two Gossip messages (ping reply) in a Gossip cycle, failure detected 
information is sent to all the alive processes across the system. Upon failure detection, the proposed 
algorithm achieved consensus logarithmically with system size. The failure information dissemination 
speed is doubled as there are two Gossip messages. Moreover, it is found that the failure information 
dissemination speed has increased and consensus time has reduced during the direct failure detection by 
individual process. Gossip messages required by the algorithm in each Gossip cycle is 2. As a result, 
the total number of gossip messages needed by the proposed algorithm is as: 


Gossip messages needed at each process to detect consensus = 2 * Gossip cycles taken 


The fault matrix is stored at each MPI process. Henceforth, the algorithm demand n? memory units 
where n is total number of processes in the system. The algorithm is implemented using basic communication 
mechanism of MPIs “point to point communication operations” listed in section 3.1. The fault matrix is 
implemented as a boolean matrix to increase scalability of the algorithm. A process is excluded from further 
communication while simulating failure injection. The experiments were performed and tested on a single 
workstation desktop computer. The workstation system is running Ubuntu 18.04.1 LTS, openMPI 4.1.0 and 
gcc 7.4.0. Experiments were executed using openMPI to evaluate the proposed algorithm. 

Failures were injected right before the execution and during the execution of the algorithm using 
epidemic protocols. Consensus is reached on failures when the epidemic protocols spread the information 
exponentially. The time out duration for one Gossip cycle length was set to 3 ms for 2° system size. The 
cycle length can be varied for a given system size which will permit to finish the matrix merge operations 
within the given cycle length. The varied gossip cycle length for different system sizes is required as the 
matrix merge operations take maximum cycle time. The scalability of the matrix and fault tolerance is tested 
through experiments. Pinging to the same node more than once is also allowed in the algorithm to use 
redundancy feature in case of Gossiping. Failures were injected randomly to the selected MPI processes 
while the proposed algorithm is run. The consensus is reached by each process at a different cycle number on 
the injected failures and so the total number of consensus reached by each process is recorded. The 
aforementioned facts have all been verified through experiments. 


Bay 


PRP OR Rr rs 


3.2. Results 

Figure 4(a) shows the amount of gossip cycles consumed to achieve global consensus by each MPI 
process after single failure injection before the algorithm execution. Figure 4(b) shows the gossip cycles 
taken to reach global consensus during the algorithm execution. In both cases, it is observed that gossip 
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cycles count increases with system size in order to reach global consensus. Figure 5 shows the percentage of 
failures detected information spread at each MPI process after failures injected. Figure 6(a) and Figure 6(b) 
shows multiple failures injection at random cycles before and during the execution of the algorithm to reach 
global consensus. Eight failures were injected right before and during the proposed algorithm that satisfies 
the property of fault tolerance. It is also noticed that, gossip cycles count taken to reach/achieve global 
consensus increased to a slight extent. Table | displays the gossip cycle number at which each MPI process 
achieved consensus for a single failure injected before and after execution of the proposed algorithm. Table 1 
displays the gossip cycle number at which each process achieved consensus for multiple ‘8’ failure injected 
before and after execution of the proposed algorithm. 
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Figure 4. Cycle number at which each process has reached consensus after single failure injection (a) before 
algorithm execution and (b) during algorithm execution 


Consensus percentage 


5 


Gossip cycle number 


Figure 5. Consensus detected locally after failure injection 
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Figure 6. Cycle number at which each process has reached consensus after multiple ‘8° failures injection 
(a) before algorithm execution and (b) during algorithm execution 
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4. 


Table 1. The statistical analysis of consensus achievement 


Single failure Multiple failures 
Gossip cycle # Before execution After execution Gossip cycle # Before execution After execution 
Process # Process # Process # Process # 
2 10, 11 = 4 10-14 = 
3 12-15 10-17 5 15-20 10-14 
4 16-23 18-25 6 21-26 15-20 
5 24-31 26-31 7 27-31 21-31 
CONCLUSION 


This paper presented a Gossip style, matrix based failure detection andconsensus algorithm that 


use randomised pinging to detect MPI process failures. The algorithm is completely fault tolerant as it 
works even in the presence of single or multiple MPI process failures. Consensus is achieved on failed 
processes based on global knowledge: every process in the system maintains the status of all other 
processes. The proposed algorithm occupies more memory as every process uses a fault matrix of O(n’), 
n indicate total processes count in the system. Failures were detected using a Gossip style protocol and 
disseminate them through out the system using the same Gossip messages. Experiments were tested 
on a workstation personal computer with 2" MPI processes. The scalability of the algorithm has 
improved by implementing with boolean values in the fault matrix at each MPI process. 
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